The mean normal body temperature was held to be 37$^{\circ}$C or 98.6$^{\circ}$F for more than 120 years since it was first conceptualized and reported by Carl Wunderlich in a famous 1868 book. In 1992, this value was revised to 36.8$^{\circ}$C or 98.2$^{\circ}$F.
In this exercise, you will analyze a dataset of human body temperatures and employ the concepts of hypothesis testing, confidence intervals, and statistical significance.
Answer the following questions in this notebook below and submit to your Github account.
You can include written notes in notebook cells using Markdown:
In [1]:
import pandas as pd
In [2]:
df = pd.read_csv('data/human_body_temperature.csv')
In [3]:
df.info()
In [5]:
df.head()
Out[5]:
In [4]:
df['temperature'].hist()
Out[4]:
No, this sample isn't normal, it is definitely skewed. However "this is a condition for the CLT... to apply" is just wrong. The whole power of the CLT is that it says that the distribution of sample means (not the sample distribution) tends to a normal distribution regardless of the distribution of the population or sample. What we do care about for the CLT is that our data is independent, which, assuming the data was gathered in a traditional manner, should be the case.
In [17]:
m=df['temperature'].mean()
m
Out[17]:
With 130 data points, it really doesn't matter if we use the normal or t distribution. A t distribution with 129 degrees of freedom is essentially a normal distribution, so the results should not be very different. However, in this day in age I don't see the purpose of even bothering with the normal distribution. Looking up t distribution tables is awfully annoying, so it once had purpose, however nowdays I'm just going to let a computer calculate either for me, and both are equally simple.
In [18]:
from scipy.stats import t, norm
from math import sqrt
In [39]:
patients=df.shape[0]
n=patients-1
In [38]:
patients
Out[38]:
In [40]:
SE=df['temperature'].std()/sqrt(n)
SE
Out[40]:
Our null hypothosis is that the true average body temperature is $98.6^\circ F$. We'll be calculating the probability of finding a value less than or equal to the mean we obtained in this data given that this null hypothosis is true, i.e. our alternative hypothosis is that the true average body temperature is less than $98.6^\circ F$
In [20]:
t.cdf((m-98.6)/SE,n)
Out[20]:
In [19]:
norm.cdf((m-98.6)/SE)
Out[19]:
Regardless of what distribution we assume we are drawing our sample means from, the probability of seeing this data or averages less than it if the true average body temperature was 98.6 is basically zero.
In [21]:
print(m+t.ppf(0.95,n)*SE)
print(m-t.ppf(0.95,n)*SE)
In [22]:
t.ppf(0.95,n)*SE
Out[22]:
Our estimate of the true average human body temperature is thus $98.2^\circ F \pm 0.1$.
This confidence interval, however, does not answer the question 'At what temperature should we consider someone's temperature to be "abnormal"?'. We can look at the population distribution, and see right away that the majority of our test subjects would be considered abnormal if we this, which makes no sense.
The confidence intervals only say something about what we can expect of sample means, not about individual values. Unfortunately, we would not expect the percentiles of this data to be drawn from a normal distribution, so I, at least, am not currently equipped to do confidence/hypothosis testing. However, I can give them, which should give a good estimate of what should be considered normal, but I can't give estimates of how confident we can be in these values.
In [50]:
df['temperature'].quantile([.1,.9])
Out[50]:
This range, 97.29-99.10 degrees F includes 80% of the patients in our sample.
This shows the dramatic difference between the population distribution and the sample distribution of the mean; we looked at the sample distribution (from the confidence interval), and found that 90% of the population fell within a $\pm 0.1^\circ$ range, while looking at the population distribution, we see a $\pm 0.9^\circ$ range for a smaller percentage of the distribution.
In [32]:
males=df[df['gender']=='M']
males.describe()
Out[32]:
In [31]:
females=df[df['gender']=='F']
females.describe()
Out[31]:
In [62]:
SEgender=sqrt(females['temperature'].std()/females.shape[0]+males['temperature'].std()/males.shape[0])
SEgender
Out[62]:
In [61]:
mgender=females['temperature'].mean()-males['temperature'].mean()
mgender
Out[61]:
In [63]:
2*(1-t.cdf(mgender/SEgender,21))
Out[63]:
The probability of seeing this difference in our data if our null hypothosis (that there is no gender difference) is true is actually relatively high, 6.5%. Using the 5% threshold, we can't reject the null hypothosis.
In [ ]: